Focused on the issue that huge modal difference between cross-modal person re-identification images, pixel alignment and feature alignment are commonly utilized by most of the existing methods to realize image matching. In order to further improve the accuracy of matching two modal images, a multi-input dual-stream network model based on dynamic dual-attention mechanism was designed. Firstly, the neural network was able to learn sufficient feature information in a limited number of samples by adding images of the same person taken by different cameras in each training batch. Secondly, the gray-scale image obtained by homogeneous augmentation was used as an intermediate bridge to retain the structural information of the visible light images and eliminate the color information at the same time. The use of gray-scale images weakened the network’s dependence on color information, thereby strengthening the network model’s ability to mine structural information. Finally, a Weighted Six-Directional triple Ranking (WSDR) loss suitable for images three modalities was proposed, which made full use of cross-modal triple relationship under different angles of view, optimized relative distance between multiple modal features and improved the robustness to modal changes. Experimental results on SYSU-MM01 dataset show that the proposed model increases evaluation indexes Rank-1 and mean Average Precision (mAP) by 4.66 and 3.41 percentage points respectively compared to Dynamic Dual-attentive AGgregation (DDAG) learning model.